October 01, 2020

Reading Data

The chunk below is to read all the given data and process it so that I have a TOTALKBTU column.

years <- 2017:2020
quarters <- 1:4
types <- c("Electric", "Gas")

pge_17_20_elec_gas <- NULL

for(year in years){
  for(quarter in quarters){
    if ((quarter %in% 3:4) && (year %in% 2020)){
      next}
  for(type in types){
    
    filename <-
      paste0("PGE_", year, "_Q", quarter, "_", type, "UsageByZip.csv")

    temp <- read_csv(filename)
    
    if(type %in% "Gas"){
      temp <- temp %>% 
        mutate(TOTALKBTU = 100 * TOTALTHM) %>% 
        select(-TOTALTHM, -AVERAGETHM)
    } else {
      temp <- temp %>% 
        mutate(TOTALKBTU = 3.412 * TOTALKWH) %>% 
        select(-TOTALKWH, -AVERAGEKWH)
    }
  
    pge_17_20_elec_gas <- rbind(pge_17_20_elec_gas,temp)
  }
  }
}

Filtering Zipcode

This chunk below is to set up the zipcode filter so that my data only includes relevant information from the 9 bay-area counties.

ca_counties <- counties("CA", cb = T, progress_bar = F)

bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

bay_counties <-
  ca_counties %>%
  filter(NAME %in% bay_county_names)

usa_zips <- 
  zctas(cb = T, progress_bar = F)

bay_zips <-
  usa_zips %>% 
  st_centroid() %>% 
  .[bay_counties, ] %>% 
  st_set_geometry(NULL) %>% 
  left_join(usa_zips %>% select(GEOID10)) %>% 
  st_as_sf()

Plot

I summarize and total the data in the 4 relevant fields (residential and commercial electricity and gas) from the bay area counties in a column named TOTALKBTU. I sum it all up and group by CUSTOMERCLASS and DATE parameters. I also separated the data into residential and commercial to better visualize the trends between each sector.

pge_residential <-
  pge_17_20_elec_gas %>% 
  mutate(
    DATE = paste(YEAR, MONTH, "01", sep = "-") %>% 
      as.Date()) %>% 
  filter(CUSTOMERCLASS %in%
           c(
             "Elec- Residential",
             "Gas- Residential")
  ) %>% 
  mutate(
    ZIPCODE = ZIPCODE %>% as.character()
  ) %>% 
  group_by(ZIPCODE) %>% 
  right_join(
    bay_zips %>% select(GEOID10),
    by = c("ZIPCODE" = "GEOID10")
  ) %>% 
  group_by(CUSTOMERCLASS,
           DATE) %>% 
  summarize(
    TOTALKBTU = sum(TOTALKBTU, na.rm = T)
  ) 

pge_residential
## # A tibble: 85 x 3
## # Groups:   CUSTOMERCLASS [3]
##    CUSTOMERCLASS     DATE         TOTALKBTU
##    <chr>             <date>           <dbl>
##  1 Elec- Residential 2017-01-01 4803453360.
##  2 Elec- Residential 2017-02-01 3821499662.
##  3 Elec- Residential 2017-03-01 3697116352.
##  4 Elec- Residential 2017-04-01 3383495869.
##  5 Elec- Residential 2017-05-01 3553549775.
##  6 Elec- Residential 2017-06-01 3938630354.
##  7 Elec- Residential 2017-07-01 4343628291.
##  8 Elec- Residential 2017-08-01 4209745762.
##  9 Elec- Residential 2017-09-01 8346356825.
## 10 Elec- Residential 2017-10-01 3569080069.
## # … with 75 more rows
pge_commercial <-
  pge_17_20_elec_gas %>% 
  mutate(
    DATE = paste(YEAR, MONTH, "01", sep = "-") %>% 
      as.Date()) %>% 
  filter(CUSTOMERCLASS %in%
           c(
             "Elec- Commercial",
             "Gas- Commercial")
  ) %>% 
  mutate(
    ZIPCODE = ZIPCODE %>% as.character()
  ) %>% 
  group_by(ZIPCODE) %>% 
  right_join(
    bay_zips %>% select(GEOID10),
    by = c("ZIPCODE" = "GEOID10")
  ) %>% 
  group_by(CUSTOMERCLASS,
           DATE) %>% 
  summarize(
    TOTALKBTU = sum(TOTALKBTU, na.rm = T)
  ) 

pge_commercial
## # A tibble: 85 x 3
## # Groups:   CUSTOMERCLASS [3]
##    CUSTOMERCLASS    DATE          TOTALKBTU
##    <chr>            <date>            <dbl>
##  1 Elec- Commercial 2017-01-01  5632166741.
##  2 Elec- Commercial 2017-02-01  4687272512.
##  3 Elec- Commercial 2017-03-01  4906689831.
##  4 Elec- Commercial 2017-04-01  4800958796.
##  5 Elec- Commercial 2017-05-01  5139406247.
##  6 Elec- Commercial 2017-06-01  5223363399.
##  7 Elec- Commercial 2017-07-01  5582878599.
##  8 Elec- Commercial 2017-08-01  5786284979.
##  9 Elec- Commercial 2017-09-01 11259448674.
## 10 Elec- Commercial 2017-10-01  6488648461.
## # … with 75 more rows
chart_residential <-
  pge_residential %>% 
  na.omit() %>% 
  ggplot() +
  geom_bar(
    aes(
      x = DATE,
      y = TOTALKBTU,
      fill = CUSTOMERCLASS
    ),
    stat = "identity",
    position = "dodge"

  ) +
  labs(
    x = "Date",
    y = "KBTU",
    title = "PG&E Bay Area Total Monthly Residential<br>Utilities Usage in KBTU, 2017-2020",
    fill = "Types of<br>Energy Usage"
  )

chart_commercial <-
  pge_commercial %>% 
  na.omit() %>% 
  ggplot() +
  geom_bar(
    aes(
      x = DATE,
      y = TOTALKBTU,
      fill = CUSTOMERCLASS
    ),
    stat = "identity",
    position = "dodge"

  ) +
  labs(
    x = "Date",
    y = "KBTU",
    title = "PG&E Bay Area Total Monthly Commercial<br>Utilities Usage in KBTU, 2017-2020",
    fill = "Types of<br>Energy Usage"
  )

As for noticeable trends within the residential sector, we can see that there has definitely been a large increase in gas usage since the months of COVID 19. As opposed to the usual decrease after the winter months, March-June of 2020 has maintained a high level of energy usage from the gas department. As for electricity, it’s a little bit harder to notice in this graph as the frame of reference on the y-axis is shared with gas. The total magnitude of gas is quite a lot higher than that of electricity, but upon closer inspection, you can notice that there is a slight increase in electricity usage in residential homes. We will better visualize it in the geospatial map shown later in the assignment.

chart_residential %>% 
  ggplotly() %>% 
  layout(
    xaxis = list(fixedrange = T),
    yaxis = list(fixedrange = T)
  ) %>%
  config(displayModeBar = F)

As for the commercial sector, there is definitely a noticeable trend in the decrease of commercial electricity usage since COVID 19. This makes sense, as many stores and shops closed down or simply put their business on hold. As for gas, it is interesting that there is also an increased usage since COVID 19. I cannot understand why there might be an increase, but it might simply be due to some areas in the data perhaps referencing commercial sectors but are actually residential homes. If this is not the case, one reason I can think of is that gas is being used in the commercial sector to power the heating of water (which is needed for the increase of household water and hot water for residential homes).

chart_commercial %>% 
  ggplotly() %>% 
  layout(
    xaxis = list(fixedrange = T),
    yaxis = list(fixedrange = T)
  ) %>%
  config(displayModeBar = F)

Comparing Pre- and Post- COVID Data

For this section, I first made a variable to include all the relevant data from the bay area and I created a column called AVERAGEKBTU so that we can better understand the overall energy use for each zipcode.

pge_all <-
  pge_17_20_elec_gas %>% 
  mutate(
    ZIPCODE = ZIPCODE %>% as.character()
  ) %>% 
  group_by(ZIPCODE) %>% 
  right_join(
    bay_zips %>% select(GEOID10),
    by = c("ZIPCODE" = "GEOID10")
  ) %>% 
  filter( 
         CUSTOMERCLASS %in% c(
        "Elec- Residential",
        "Elec- Commercial",
        "Gas- Residential",
        "Gas- Commercial")
        ) %>% 
  select(
    -COMBINED) %>% 
  mutate(
    AVERAGEKBTU = TOTALKBTU / TOTALCUSTOMERS) %>%  
  mutate(
    DATE = paste(YEAR, paste0("0",MONTH), "01", sep = "-") %>% 
      as.Date()
    )%>% 
  select(-MONTH, -YEAR)

To visualize the neighbourhoods that had the greatest electricity change due to COVID, I had to filter the data so that it included the months of COVID and the months without COVID. However, I didn’t just choose the months of January-March as PRE-COVID because the frame of reference had to be consistent. The winter months of January-March, as can be seen from the chart before, has generally higher electricity consumption due to colder temperatures (electric heaters). As a result, I chose the months of March 1st to June 1st of 2019 and set that as PRE-COVID while I chose March 1st to June 1st for this year and set that as POST-COVID. I did not choose 2018 and 2017 because I thought that those years would be too far off that people’s habits have changed since then. I also only used residential electricity usage because I wanted to visualize who would likely be directly impacted by the increase in electricity bills due to residential electricity consumption increases. One thing to note is that between the two years, the ZIPCODE results are different. The two dataframes had varying entry lengths. When I looked through the data, the older dataset had some zipcodes that weren’t used in the new one and vice versa. I could imagine this is due to the development of the area and the usage of some new zipcodes while data could have perhaps been missing for the others due to mishandling. Either way, some of the zipcodes did not match up so I omitted ones with values that were not available and I believe that cleared up all the values that were missing, thereby matching the available datasets together by zipcode.

pge_pre_covid <-
  filter(pge_all,
         DATE >= as.Date("2019-03-01") & DATE <= as.Date("2019-06-01"),
         CUSTOMERCLASS == "Elec- Residential"
         )%>% 
  na.omit() %>% 
  group_by(ZIPCODE,
           DATE) %>% 
  summarize(TOTALPRE = sum(TOTALKBTU)) 

pge_pre_covid
## # A tibble: 1,068 x 3
## # Groups:   ZIPCODE [267]
##    ZIPCODE DATE        TOTALPRE
##    <chr>   <date>         <dbl>
##  1 94002   2019-03-01 15670057.
##  2 94002   2019-04-01 13092936.
##  3 94002   2019-05-01 13256889.
##  4 94002   2019-06-01 12921855.
##  5 94005   2019-03-01  2669122.
##  6 94005   2019-04-01  2195329.
##  7 94005   2019-05-01  2207762.
##  8 94005   2019-06-01  2089263.
##  9 94010   2019-03-01 34121552.
## 10 94010   2019-04-01 28494011.
## # … with 1,058 more rows
pge_post_covid <-
  filter(pge_all,
         DATE >= as.Date("2020-03-01") & DATE <= as.Date("2020-06-01"),
         CUSTOMERCLASS == "Elec- Residential"
         )%>% 
  na.omit() %>% 
  group_by(ZIPCODE,
           DATE) %>% 
  summarize(TOTALPOST = sum(TOTALKBTU)) 

pge_post_covid
## # A tibble: 1,068 x 3
## # Groups:   ZIPCODE [267]
##    ZIPCODE DATE       TOTALPOST
##    <chr>   <date>         <dbl>
##  1 94002   2020-03-01 16308012.
##  2 94002   2020-04-01 14838170.
##  3 94002   2020-05-01 14308799.
##  4 94002   2020-06-01 13501779.
##  5 94005   2020-03-01  2777757.
##  6 94005   2020-04-01  2558167.
##  7 94005   2020-05-01  2394262.
##  8 94005   2020-06-01  2255366.
##  9 94010   2020-03-01 35412868.
## 10 94010   2020-04-01 31997446.
## # … with 1,058 more rows

To compare the two datasets, I binded one column of TOTALKBTU from one dataset with the other (and they should be the same given that the data is organized by ZIPCODE, as explained above), and created a new column called PERCENTCHANGE so that I could visualize the percent increase or decrease from 2019 to 2020. I used the percentage change formula to calculate the percent change using 2019 as the frame of reference. I then joined the geospatial data by zipcode so that I could use that to plot my data.

pge_compare <- 
  pge_pre_covid %>% 
  cbind(TOTALPOST = pge_post_covid$TOTALPOST) %>% 
  mutate(PERCENTCHANGE = (TOTALPOST-TOTALPRE)/TOTALPRE * 100) %>% 
  mutate(
    ZIPCODE = ZIPCODE %>% as.character()
  ) %>% 
  group_by(ZIPCODE) %>% 
  right_join(
    bay_zips %>% select(GEOID10),
    by = c("ZIPCODE" = "GEOID10")
  ) %>% 
  st_sf() 

pge_compare
## Simple feature collection with 1098 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.3083 ymax: 38.93713
## geographic CRS: NAD83
## # A tibble: 1,098 x 6
## # Groups:   ZIPCODE [297]
##    ZIPCODE DATE       TOTALPRE TOTALPOST PERCENTCHANGE                  geometry
##    <chr>   <date>        <dbl>     <dbl>         <dbl>        <MULTIPOLYGON [°]>
##  1 94002   2019-03-01   1.57e7 16308012.          4.07 (((-122.3359 37.50805, -…
##  2 94002   2019-04-01   1.31e7 14838170.         13.3  (((-122.3359 37.50805, -…
##  3 94002   2019-05-01   1.33e7 14308799.          7.93 (((-122.3359 37.50805, -…
##  4 94002   2019-06-01   1.29e7 13501779.          4.49 (((-122.3359 37.50805, -…
##  5 94005   2019-03-01   2.67e6  2777757.          4.07 (((-122.442 37.69435, -1…
##  6 94005   2019-04-01   2.20e6  2558167.         16.5  (((-122.442 37.69435, -1…
##  7 94005   2019-05-01   2.21e6  2394262.          8.45 (((-122.442 37.69435, -1…
##  8 94005   2019-06-01   2.09e6  2255366.          7.95 (((-122.442 37.69435, -1…
##  9 94010   2019-03-01   3.41e7 35412868.          3.78 (((-122.4126 37.58916, -…
## 10 94010   2019-04-01   2.85e7 31997446.         12.3  (((-122.4126 37.58916, -…
## # … with 1,088 more rows

I created a map to visualize which neighbourhoods experienced the most residential electricity consumption change. I set the domain as -10:20 because I noticed that most zipcodes had an increase in electricity consumption. While there are zipcodes that lie outside this domain, leaflet treats them as NA and marks them as gray on the map. I also set the palette values so that it would reflect the scale on the domain. Decreases in residential electricity result in a grayer outlook on the map while increases in residential electricity result in a darker blue shade.

res_pal <- colorNumeric(
  palette = "Blues",
  domain = 
    -10:20
)

leaflet() %>% 
  addTiles() %>% 
  addPolygons(
    data = pge_compare,
    fillColor = ~res_pal(PERCENTCHANGE),
    color = "white",
    opacity = 0.5,
    fillOpacity = 0.5,
    weight = 1,
    label = ~paste0(
      round(PERCENTCHANGE), 
      " total percentage change in ",
      ZIPCODE
    ),
    highlightOptions = highlightOptions(
      weight = 2,
      opacity = 1
    )
  ) %>% 
  addLegend(
    data = pge_compare,
    pal = res_pal,
    values = ~-10:20,
    title = "Total Percentage Change<br>in Residential Electricity<br>Usage in the months of March-June<br> between 2019 and 2020"
  )